fix(merge): make candidate dedup anchor-insensitive (align with status-preservation keys)#136
Merged
Merged
Conversation
…s-preservation keys) normalize_rows keyed dedup on the full source string including any #anchor, while every status-preservation key (superseded / engine / review) strips the anchor. That asymmetry kept the same (subject, relation, object) re-extracted from the same file with a drifted/added anchor as two candidate rows, even though the status layer treats them as one fact. Dedup now keys on (subject, relation, object, source-file), stripping the anchor like the preservation keys. On collision the surviving row is chosen deterministically — the full source that sorts lexicographically first wins, so a bare path beats any anchored variant and otherwise the first anchor wins, independent of input order. Source-existence validation and NFC normalisation are untouched. Closes #135
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the lone anchor-sensitive key in the merge path.
normalize_rowsdeduped raw candidates on the fullsourcestring including any#anchor, while every status-preservation key in the same file (existing_superseded_keys,existing_engine_keys,existing_review_keys) deliberately strips the anchor. That asymmetry kept the same(subject, relation, object)re-extracted from the same file with a drifted/added/invalid anchor as two candidate rows — even though the status layer treats them as one fact, leaving two candidate rows under one status identity.Change
(subject, relation, object, source.partition("#")[0])— anchor-insensitive, matching the preservation keys.sourcesorts lexicographically first wins. A baresources/a.mdis a prefix of (thus<) anysources/a.md#anchor, so bare beats anchored; otherwise the lexicographically-first anchor wins. The winner is fixed by value, independent of input row order (no longer incidental to iteration).partition("#")check) and NFC normalisation are unchanged.This intentionally collapses cross-section provenance, consistent with how the status layer already treats anchors as insignificant. Per-section provenance, if ever wanted, should be a separate deliberate feature.
Tests
tests/unit/test_dedup_anchor.py: bare-vs-anchored collapse, two-anchor collapse, order-independence of the surviving source, and distinct-triple retention — each asserting exactly whichsourcesurvives.tests/test_*.shshell harnesses pass; golden engine regression passes;ruffclean.Closes #135